--- layout: home author_profile: true header: title: "Clustering sub-district in Bangkok, Thailand" categories: - Blog tags: - Foursquare API - location data ---
Methodology
Result
Bangkok is the capital and most populous city of Thailand. It is known in Thai as Krung Thep Maha Nakhon or simply Krung Thep. The city occupies 1,568.7 square kilometers (605.7 sq mi) in the Chao Phraya River delta in central Thailand and has an estimated population of 10.539 million as of 2020, 15.3 percent of the country's population. Over fourteen million people (22.2 percent) lived within the surrounding Bangkok Metropolitan Region at the 2010 census, making Bangkok an extremely primate city, dwarfing Thailand's other urban centers in both size and importance to the national economy.
How to find a suitable restaurant business location in Bangkok?
From the information, Bangkok is the city with the highest population density in Thailand. There are many buildings and meeting places. Therefore, it is interesting for investors to build a business, specifically a restaurant business. But it is difficult to find the suitable place for the restaurant location. The problem with stakeholders is finding a place with less restaurant density and high population density. Therefore, in this project, we will use data science methods to analyze the data to display the results for stakeholders to consider as part of business decision-making.
Investors interested in investing, and stakeholders will be interested in this project. And take this as part of your investment consideration. Once we have analyzed the existing data and solved these problems with machine learning methods. Then create a map and visualized group each district of population density and restaurants density.
The data I use includes:
population data - I discover population data of each district in Bangkok from the Thailand Digital Government Development Agency (DGA), which this uses for calculating population density.
Bangkok boundary coordinates - I searched for a Bangkok boundary file or a GeoJson file to create a choropleth map. However, I discovered a Thailand administrative region - sub from data.humdata.org as a shapefile (.shp) containing the coordinates of all cities of Thailand. So I selected only the properties that I required, which were Bangkok's coordinates, and save it as a GeoJson file to reduce the file size and flexibility in working.
sub-district location coordinates - I use the Bangkok dataset from the Thailand Digital Government Development Agency (DGA), which contains the longitude and latitude coordinates for each sub-district in Bangkok.
Forsquare API - I used Forsquare API to get the restaurant's neighborhood given coordinates center point for each sub-district of Bangkok.
I will divide the data into two parts. The first part uses population data and Bangkok boundary coordinates to calculate the population density. The second part uses the district location coordinates as input to the Forsquare API to find restaurants within a radius of 500 meters from the location coordinates.
#@title
import numpy as np
import pandas as pd
import requests
import json
import seaborn as sns
# use the inline backend to generate the plots within the browser
%matplotlib inline
import matplotlib as mpl
import matplotlib.pyplot as plt
mpl.style.use('ggplot') # optional: for ggplot-like style
from matplotlib.patches import Patch
# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors
# import k-means from clustering stage
from sklearn.cluster import KMeans
from sklearn.cluster import MeanShift, estimate_bandwidth
from sklearn.datasets import make_blobs
from sklearn import preprocessing
!pip install -q geopy
from geopy.geocoders import Nominatim # module to convert an address into latitude and longitude values
# libraries for displaying images
from IPython.display import Image
from IPython.core.display import HTML
# tranforming json file into a pandas dataframe library
from pandas.io.json import json_normalize
#!pip install -q folium==0.5.0
import folium # plotting library
!pip install -q django
from django.contrib.gis.geos import Polygon # convert an UTM coordinates into latitude and longitude values
from pprint import pprint
#that extends the datatypes used by pandas to allow spatial operations on geometric types
!pip install -q geopandas
import geopandas as gpd
#an utility module to deal with colormaps
import branca.colormap as cm
#add font thai
#import matplotlib.font_manager as fm
#font_list = fm.createFontList(['THSarabunNew.ttf'])
#fm.fontManager.ttflist.extend(font_list)
# set font
#plt.rcParams['font.family'] = 'TH Sarabun New'
#plt.rcParams['xtick.labelsize'] = 20.0
#plt.rcParams['ytick.labelsize'] = 20.0
url = "https://raw.githubusercontent.com/momijizen/Coursera_Capstone/main/sub_district_location_22.csv"
df_province = pd.read_csv(url)
df_province.head()
#select only Bangkok city
df_province = df_province[df_province['CHANGWAT_E'] == 'Bangkok']
df_province.reset_index(drop=True,inplace=True)
#select columns
df_province = df_province[['TA_ID','TAMBON_T','TAMBON_E','AM_ID','AMPHOE_T','AMPHOE_E','LAT','LONG']]
#rename columns
df_province.columns = ['sub_district_id','sub_district_th', 'sub_district_eng','district_id','district_th','district_eng','latitude','longitude']
df_province = df_province.astype({'sub_district_id':str,'district_id':str})
df_province.head()
path = "https://raw.githubusercontent.com/momijizen/Coursera_Capstone/main/population.csv"
df_population = pd.read_csv(path)
df_population.head()
#select only population while year 2018
df_district_population = df_population[['dcode','population61']]
df_district_population.columns = ['district_id','population']
df_district_population = df_district_population.astype({'district_id':str})
df_district_population.set_index('district_id', inplace=True)
df_district_population.head()
#group by district
df_district = df_province.groupby(by=['district_id','district_eng','district_th']).count().reset_index()
df_district = df_district[['district_id','district_eng','district_th']]
df_district.set_index('district_id', inplace=True)
#join district with population
df_district_population = df_district_population.merge(df_district, how='inner',left_index=True, right_index=True).reset_index()
df_district_population.head()
Unzip Thailand administrative boundary shapefiles
import zipfile as zf
files = zf.ZipFile("th_borough.zip", 'r')
files.extractall('directory to extract')
files.close()
using geopandas libraries for reading shapefiles to dataframe format.
#read shapefile
#fname = 'th_borough/tha_admbnda_adm2_rtsd_20190221.shp'
fname = 'tha_admbnda_adm3_rtsd_20190221.shp'
tha = gpd.read_file(fname)
#tha.crs = "epsg:4326"
tha.head()
# select only Bangkok shape
bangkok = tha[tha['ADM1_EN'] == 'Bangkok']
# select columns
bangkok = bangkok[['Shape_Leng','Shape_Area','ADM3_PCODE', 'ADM3_EN','ADM3_TH', 'geometry']]
# slice 'TH' out from PCODE columns
bangkok['ADM3_PCODE'] = bangkok['ADM3_PCODE'].str.slice(2,)
bangkok.reset_index(drop=True, inplace=True)
bangkok.head()
Save it as a geojson file for a new runtime in future that will reduce step and time loading extensive file.
#save bangkok geojson file
bangkok.to_file("bangkok_district.geojson", driver='GeoJSON')
bangkok district
#!wget -q 'https://raw.githubusercontent.com/momijizen/Coursera_Capstone/main/bangkok_district.geojson'
#bkk2 = json.load(open('bangkok_district.geojson'), encoding='utf-8')
#df = json_normalize(bkk2["features"])
district_geojson_path = 'https://raw.githubusercontent.com/momijizen/Coursera_Capstone/main/bangkok_district.geojson'
district_geojson = gpd.read_file(district_geojson_path)
district_geojson.head()
bangkok sub-district
sub_district_geojson_path = 'https://raw.githubusercontent.com/momijizen/Coursera_Capstone/main/bangkok_sub_district.geojson'
sub_district_geojson = gpd.read_file(sub_district_geojson_path)
sub_district_geojson.head()
Make sure the AMD2_pcode is matching with the district id in df_district_population .
geo_district = district_geojson['ADM2_PCODE'].tolist()
district_id = df_district_population['district_id'].tolist()
print(list(set(district_id) - set(geo_district)))
print(list(set(geo_district) - set(district_id)))
Join shape area with population district
df = district_geojson[['Shape_Area','ADM2_PCODE']]
df.columns = ['shape_area','district_id']
df = df.astype({'district_id':str})
df.head()
df_district_population = df_district_population.merge(df,on=['district_id','district_id'])
df_district_population.head()
Convert square kilometers to square meters and calculate people per square meter.
df_district_population['pop_density'] = df_district_population['population'] / (df_district_population['shape_area'] * 1e+6) #1000000
df_district_population.head()
Merge pop_density conlumn to df_province dataframe
df_province = df_province.merge(df_district_population[['district_id','pop_density']],on='district_id',how='left')
Merge pop_density conlumn to sub_district_geojson geodataframe
sub_district_geojson.columns = ['sub_district_id','sub_district_eng', 'sub_district_th','district_id','district_eng','district_th','geometry']
sub_district_geojson = sub_district_geojson.merge(df_district_population[['district_id','pop_density']],left_on='district_id',right_on='district_id',how='left')
Define Foursquare Credentials and Version
CLIENT_ID = 'M1QUR2OTW2YEYMXNYIF3AOIRNPCTUXFGQQCXAV3HGOTXEAO4' # your Foursquare ID
CLIENT_SECRET = 'XRK0T0KEG1UDGCU5YEGXMXR3QIRU02JZY3QOBN0BYLNPS52N' # your Foursquare Secret
ACCESS_TOKEN = '1KDLE0AX2IH4RBQL5P41P0IW4TFKATXNH5TO1VZPVUCIKN3Q' # your FourSquare Access Token
VERSION = '20180604'
LIMIT = 500
search_query = 'restaurant'
radius = 500
print('Your credentails:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)
Create a function to find the venues of all district neighborhoods.
def getNearbyVenues(sub_district_id,district_id, latitudes, longitudes, radius=500):
venues_list=[]
for sub_district_id,district_id, lat, lng in zip(sub_district_id,district_id, latitudes, longitudes):
#print(name)
# create the API request URL
url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
CLIENT_ID,
CLIENT_SECRET,
VERSION,
lat,
lng,
radius,
LIMIT)
# make the GET request
results = requests.get(url).json()["response"]['groups'][0]['items']
# return only relevant information for each nearby venue
venues_list.append([(
sub_district_id,
district_id,
lat,
lng,
v['venue']['id'],
v['venue']['name'],
v['venue']['location']['lat'],
v['venue']['location']['lng'],
v['venue']['location']['distance'],
v['venue']['categories'][0]['name']) for v in results])
nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
nearby_venues.columns = ['sub_district_id','district_id',
'Neighborhood Latitude',
'Neighborhood Longitude',
'Id',
'Venue',
'Venue Latitude',
'Venue Longitude',
'Venue Distance',
'Venue Category']
return(nearby_venues)
df_venues = getNearbyVenues(sub_district_id=df_province['sub_district_id'],
district_id= df_province['district_id'],
latitudes= df_province['latitude'],
longitudes= df_province['longitude']
)
df_venues_new = df_venues
print(df_venues.shape)
df_venues.head(10)
df_venues = df_venues.sort_values(by='Venue Distance').drop_duplicates(subset=['Id'],keep='first' )
df_venues.shape
df_restaurant = df_venues[df_venues['Venue Category'].str.lower().str.contains('restaurant')].reset_index(drop=True)
df_restaurant.shape
Displays the name of the restaurant category.
category_restaurant = df_restaurant.groupby(by='Venue Category')['Venue Category'].count().reset_index(name='count').sort_values(by='count', ascending=False)
print(category_restaurant.shape)
category_restaurant
category_restaurant.sort_values(by='count',ascending=False)[:].plot(
kind='bar',x='Venue Category',y='count',figsize=(15,5))
#plt.xticks(rotation=70)
plt.ylabel('Number of Restaurant Category')
plt.xlabel('Restaurant Category')
plt.title('Number of Restaurant Category')
restaurant_group = df_restaurant.groupby(by='sub_district_id')['Venue Category'].count().reset_index(name='restaurant_count').sort_values(by='restaurant_count', ascending=False)
restaurant_group.head()
Divide with area of a circle, radius = 500 meter
restaurant_group['rest_density'] = restaurant_group['restaurant_count'] / (np.pi * 500 * 500)
restaurant_group.head()
df_province.shape
df_province = df_province.merge(restaurant_group ,on='sub_district_id',how='left')
df_province
df_province[['restaurant_count' ,'rest_density']] = df_province[['restaurant_count' ,'rest_density']].fillna(0)
df_province.info()
Merge rest_density conlumn to sub_district_geojson geodataframe
sub_district_geojson = sub_district_geojson.merge(df_province[['sub_district_id','rest_density']],left_on='sub_district_id',right_on='sub_district_id',how='left')
One-hot Dataframe
restaurant_onehot = pd.get_dummies(df_restaurant['Venue Category'])
restaurant_onehot['sub_district_id'] = df_restaurant['sub_district_id']
restaurant_group_mean = restaurant_onehot.groupby('sub_district_id').mean().reset_index()
the function to sort the restaurant in descending order.
def return_most_common_venues(row, num_top_venues):
row_categories = row.iloc[1:]
row_categories_sorted = row_categories.sort_values(ascending=False)
return row_categories_sorted.index.values[0:num_top_venues]
num_top_venues = 1
indicators = ['st', 'nd', 'rd']
# create columns according to number of top venues
columns = ['sub_district_id']
for ind in np.arange(num_top_venues):
try:
columns.append('{}{} Most Restaurant Category'.format(ind+1, indicators[ind]))
except:
columns.append('{}th Most Restaurant Category'.format(ind+1))
# create a new dataframe
restaurant_most = pd.DataFrame(columns=columns)
restaurant_most['sub_district_id'] = restaurant_group_mean['sub_district_id']
for ind in np.arange(restaurant_group_mean.shape[0]):
restaurant_most.iloc[ind, 1:] = return_most_common_venues(restaurant_group_mean.iloc[ind, :], num_top_venues)
restaurant_most
merge 1st Most Restaurant Category column to df_province dataframe
df_province = df_province.merge(restaurant_most,on='sub_district_id',how='left')
df_province.head()
df_province.info()
fig = plt.figure() # create figure
ax0 = fig.add_subplot(1, 2, 1) # add subplot 1 (1 row, 2 columns, first plot)
ax1 = fig.add_subplot(1, 2, 2) # add subplot 2 (1 row, 2 columns, second plot). See tip below**
# Subplot 1: Box plot
count, bin_edges = np.histogram(df_province['pop_density'])
df_province['pop_density'].plot(kind='hist',figsize=(15,6), color='#6D48B6',xticks=bin_edges,ax=ax0)# add to subplot 1
ax0.set_title('Histogram of population density')
ax0.set_ylabel('Number of sub-district')
ax0.set_xlabel('population density')
# Subplot 2: Line plot
df_province['pop_density'].plot(kind='box',figsize=(15,6), color='#6D48B6',ax=ax1)# add to subplot 2
ax1.set_title('Box Plots of population density')
ax1.set_ylabel('population density')
plt.show()
fig = plt.figure() # create figure
ax0 = fig.add_subplot(1, 2, 1) # add subplot 1 (1 row, 2 columns, first plot)
ax1 = fig.add_subplot(1, 2, 2) # add subplot 2 (1 row, 2 columns, second plot). See tip below**
# Subplot 1: Box plot
count, bin_edges = np.histogram(df_province['rest_density'])
df_province['rest_density'].plot(kind='hist',figsize=(15,6), color='#CF2F49',xticks=bin_edges,ax=ax0)# add to subplot 1
ax0.set_title('Histogram of restaurant density')
ax0.set_ylabel('Number of sub-district')
ax0.set_xlabel('restaurant density')
# Subplot 2: Line plot
df_province['rest_density'].plot(kind='box',figsize=(15,6), color='#CF2F49',ax=ax1)# add to subplot 2
ax1.set_title('Box Plots of restaurant density')
ax1.set_ylabel('restaurant density')
plt.show()
sns.jointplot(x='pop_density', y='rest_density',data=df_province , height=8)
#plt.title('population density and restaurant density')
plt.ylabel('population density')
plt.xlabel('restaurant density')
Use geopy library to get the latitude and longitude values of Bangkok City.
address = 'Bangkok, Thailand'
geolocator = Nominatim(user_agent="foursquare_agent")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print(latitude, longitude)
bangkok_geo = sub_district_geojson #'bangkok_sub_district.geojson' # geojson file
# create a plain province map
bangkok_map = folium.Map(location=[latitude, longitude], zoom_start=10)
#use different map tiles (OpenStreetMap, CartoDB, Stamen, Mapbox...)
folium.TileLayer('CartoDB positron',name="Light Map",control=False).add_to(bangkok_map)
# generate choropleth map using the population of each district YlOrRd
bangkok_map.choropleth(
geo_data=bangkok_geo,
data=df_province,
columns=['sub_district_id','pop_density'],
key_on='feature.properties.sub_district_id',
fill_color='YlGnBu',
fill_opacity=0.7,
line_opacity=0.2,
legend_name='people per square meter of each sub-district'
)
# display map
#bangkok_map
style_function = lambda x: {'fillColor': '#ffffff',
'color':'#000000',
'fillOpacity': 0.1,
'weight': 0.1}
highlight_function = lambda x: {'fillColor': '#000000',
'color':'#000000',
'fillOpacity': 0.50,
'weight': 0.1}
NIL = folium.features.GeoJson(
sub_district_geojson,
style_function=style_function,
control=False,
highlight_function=highlight_function,
tooltip=folium.features.GeoJsonTooltip(
fields=['sub_district_eng','pop_density','rest_density'],
aliases=['Sub-district name: ','People per square meter : ','Restaurant per square meter : '],
style=("background-color: white; color: #333333; font-family: arial; font-size: 12px; padding: 10px;")
)
)
bangkok_map.add_child(NIL)
bangkok_map.keep_in_front(NIL)
folium.LayerControl().add_to(bangkok_map)
bangkok_map
df_province['pop_density_zscore'] = (df_province['pop_density'] - df_province['pop_density'].mean() ) / df_province['pop_density'].std()
df_province['rest_density_zscore'] = (df_province['rest_density'] - df_province['rest_density'].mean() ) / df_province['rest_density'].std()
test_clustering = df_province[['sub_district_id','rest_density_zscore' ,'pop_density_zscore']]
test_clustering.set_index('sub_district_id',inplace=True)
Sum_of_squared_distances = []
K = range(1,20)
for k in K:
km = KMeans(n_clusters=k)
km = km.fit(test_clustering)
Sum_of_squared_distances.append(km.inertia_)
plt.figure(figsize=(8,8))
plt.plot(K, Sum_of_squared_distances, 'bx-')
# draw vertical line from (70,100) to (70, 250)
plt.plot([5, 5], [0, 200], 'k--', lw=2)
plt.text(5.3,150,'k = 5',fontsize=16)
plt.xlabel('k')
plt.ylabel('Sum of squared distances')
plt.title('Elbow Method For Optimal k')
plt.show()
# set number of clusters
kclusters = 5
grouped_clustering = df_province[['sub_district_id','rest_density_zscore' ,'pop_density_zscore']]
grouped_clustering.set_index('sub_district_id',inplace=True)
# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(grouped_clustering)
# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10]
#df_province.drop('Cluster Labels (K-mean)',axis=1,inplace=True)
df_cluster = df_province
df_cluster.insert(0, 'Cluster Labels (K-mean)', kmeans.labels_)
df_cluster.head()
plt.figure(figsize=(7,7))
sns.scatterplot(x='pop_density_zscore',y='rest_density_zscore',hue='Cluster Labels (K-mean)',
palette=sns.color_palette("hls",5),data=df_province, legend="full")
plt.title('K-mean clustering')
plt.xlabel('population density')
plt.ylabel('restaurant density')
plt.show()
grouped_clustering = df_province[['sub_district_id','rest_density_zscore' ,'pop_density_zscore']]
grouped_clustering.set_index('sub_district_id',inplace=True)
bandwidth = estimate_bandwidth(grouped_clustering, quantile=0.2, n_samples=170)
ms = MeanShift(bandwidth=bandwidth, bin_seeding=True)
ms.fit(grouped_clustering)
labels = ms.labels_
cluster_centers = ms.cluster_centers_
labels_unique = np.unique(labels)
n_clusters_ = len(labels_unique)
print("number of estimated clusters : %d" % n_clusters_)
df_cluster = df_province
df_cluster.insert(0, 'Cluster Labels (Mean-shift)',ms.labels_)
df_cluster.head()
plt.figure(figsize=(7,7))
sns.scatterplot(x='pop_density_zscore',y='rest_density_zscore',hue='Cluster Labels (Mean-shift)',
palette=sns.color_palette("hls",6),data=df_province, legend="full")
plt.title('Mean-shift clustering')
plt.xlabel('population density')
plt.ylabel('restaurant density')
plt.legend(loc='upper right',title='Cluster Labels')
plt.show()
f, axs = plt.subplots(1, 2, figsize=(10, 7), gridspec_kw=dict(width_ratios=[5, 5]))
# Subplot 1: Box plot
sns.scatterplot(x='pop_density_zscore',y='rest_density_zscore',hue='Cluster Labels (Mean-shift)',
palette=sns.color_palette("hls",6),data=df_province,
legend="full",ax=axs[1])
axs[1].set_ylabel('restaurant density')
axs[1].set_xlabel('population density')
axs[1].set_title('Mean-shift clustering')
axs[1].legend(loc='upper right',title='Cluster Labels')
# Subplot 2: Line plot
sns.scatterplot(x='pop_density_zscore',y='rest_density_zscore',hue='Cluster Labels (K-mean)',
palette=sns.color_palette("hls",5),data=df_province,
legend="full",ax=axs[0])
axs[0].set_ylabel('restaurant density')
axs[0].set_xlabel('population density')
axs[0].set_title('K-Mean clustering')
axs[0].legend(loc='upper right',title='Cluster Labels')
f.tight_layout()
sub_district_geojson = sub_district_geojson.merge(df_province[['sub_district_id','Cluster Labels (Mean-shift)','Cluster Labels (K-mean)','1st Most Restaurant Category']],left_on='sub_district_id',right_on='sub_district_id',how='left')
# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors
sub_bangkok_geo = sub_district_geojson#'bangkok_sub_district.geojson' # geojson file
# create a plain province map
kmean_map = folium.Map(location=[latitude, longitude], zoom_start=13)
#use different map tiles (OpenStreetMap, CartoDB, Stamen, Mapbox...)
folium.TileLayer('OpenStreetMap',name="Light Map",control=False).add_to(kmean_map)
# generate choropleth map using the population of each district
kmean_map.choropleth(
geo_data=sub_bangkok_geo,
data=df_province,
columns=['sub_district_id','pop_density'],
key_on='feature.properties.sub_district_id',
fill_color='YlOrRd',
fill_opacity=0.7,
line_opacity=0.2,
legend_name='population per square meter of each sub-district'
)
# set color scheme for the clusters
x = np.arange(5)
ys = [i + x + (i*x)**2 for i in range(5)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]
# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(df_province['latitude'], df_province['longitude'], df_province['sub_district_eng'], df_province['Cluster Labels (K-mean)']):
label = folium.Popup(str(poi) + ' Cluster Labels (K-mean)' + str(cluster), parse_html=True)
folium.CircleMarker(
[lat, lon],
radius=5,
popup=label,
color=rainbow[cluster-1],
fill=True,
fill_color=rainbow[cluster-1],
fill_opacity=0.7).add_to(kmean_map)
style_function = lambda x: {'fillColor': '#ffffff',
'color':'#000000',
'fillOpacity': 0.1,
'weight': 0.1}
highlight_function = lambda x: {'fillColor': '#000000',
'color':'#000000',
'fillOpacity': 0.50,
'weight': 0.1}
BKK = folium.features.GeoJson(
sub_district_geojson,
style_function=style_function,
control=False,
highlight_function=highlight_function,
tooltip=folium.features.GeoJsonTooltip(
fields=['sub_district_eng','rest_density','pop_density','Cluster Labels (K-mean)','1st Most Restaurant Category'],
aliases=['Sub-district name: ','Restaurant per square meter : ','Peoper per square meter : ','Cluster Label (K-mean) :','The most common category of restaurants :'],
style=("background-color: white; color: #333333; font-family: arial; font-size: 12px; padding: 10px;")
)
)
kmean_map.add_child(BKK)
kmean_map.keep_in_front(BKK)
folium.LayerControl().add_to(kmean_map)
kmean_map
sub_bangkok_geo = sub_district_geojson#'bangkok_sub_district.geojson' # geojson file
# create a plain province map
sub_bangkok_map = folium.Map(location=[latitude, longitude], zoom_start=13)
#use different map tiles (OpenStreetMap, CartoDB, Stamen, Mapbox...)
folium.TileLayer('OpenStreetMap',name="Light Map",control=False).add_to(sub_bangkok_map)
# generate choropleth map using the population of each district
sub_bangkok_map.choropleth(
geo_data=sub_bangkok_geo,
data=df_province,
columns=['sub_district_id','pop_density'],
key_on='feature.properties.sub_district_id',
fill_color='YlOrRd',
fill_opacity=0.7,
line_opacity=0.2,
legend_name='population per square meter of each sub-district'
)
# set color scheme for the clusters
x = np.arange(6)
ys = [i + x + (i*x)**2 for i in range(6)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]
# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(df_province['latitude'], df_province['longitude'], df_province['sub_district_eng'], df_province['Cluster Labels (Mean-shift)']):
label = folium.Popup(str(poi) + ' Cluster Labels (Mean-shift)' + str(cluster), parse_html=True)
folium.CircleMarker(
[lat, lon],
radius=5,
popup=label,
color=rainbow[cluster-1],
fill=True,
fill_color=rainbow[cluster-1],
fill_opacity=0.7).add_to(sub_bangkok_map)
style_function = lambda x: {'fillColor': '#ffffff',
'color':'#000000',
'fillOpacity': 0.1,
'weight': 0.1}
highlight_function = lambda x: {'fillColor': '#000000',
'color':'#000000',
'fillOpacity': 0.50,
'weight': 0.1}
BKK = folium.features.GeoJson(
sub_district_geojson,
style_function=style_function,
control=False,
highlight_function=highlight_function,
tooltip=folium.features.GeoJsonTooltip(
fields=['sub_district_eng','rest_density','pop_density','Cluster Labels (Mean-shift)','1st Most Restaurant Category'],
aliases=['Sub-district name: ','Restaurant per square meter : ','Peoper per square meter : ','Cluster Label (Mean-shift)','The most common category of restaurants :'],
style=("background-color: white; color: #333333; font-family: arial; font-size: 12px; padding: 10px;")
)
)
sub_bangkok_map.add_child(BKK)
sub_bangkok_map.keep_in_front(BKK)
folium.LayerControl().add_to(sub_bangkok_map)
sub_bangkok_map
From a problem is clustering sub-district in Bangkok and there are two solutions I have selected: the first method is K-Mean clustering, which is the most popular method. In order to train the model with my dataset, which is a small dataset (170 samples) and it is difficult to select the suitable k. So, I chose the second method which is Mean-Shift clustering, which is appropriate for a small dataset and does not set the number of clusters.
However, the model imported dataset has only 2 features, which are very minimal. So, in order to improve this project in the future, I plan to add more analytical features.
In this study, I am trying to cluster the sub-district in Bangkok as an informative decision for investors to build a business, especially a restaurant business. It considers two features: population density and restaurant density. However, the restaurant data that we analyzed is taken from the Foursquare API, which only retrieves the restaurants that are online on the Foursquare platform, so there are many offline restaurants that have not been considered. This is an important part that we must continue to improve. However, in this study, it is still beneficial to the creators and those who are interested in bringing this concept to the next level.